Class 30
DATA1220-55, Fall 2024
2024-11-18
Remaining Tools
- ANOVA test for the difference between 3+ unpaired means
- Pearson correlation test for dependence between 2 numeric variables
- Linear regression for dependence between 1 or more explanatory variables (numeric or categorical) and a numerical or binary (0/1) response variable (if time)
- Developing a research question with a testable hypothesis
- Communicating statistical methods and analysis results
- Data visualization tips & tricks, do’s & dont’s
- Statistical analysis best practice
ANOVA and the F-Test for Comparing 3+ Means
- The ANOVA test (Analysis of Variance) tests for a difference between the means \(\mu_i\) of \(k\) groups (\(k \ge 3\)).
Compares the ratio of the between-group variability in means to what would be expected based on the within-group variability in the means.
\[
\text{test statistic}=\frac{\text{Between-Groups Variability/Error}}{\text{Within-Groups Variability/Error}}
\]
- The probability of the observed difference in between-group variability vs within-group variability is calculated using the \(F\) distribution.
- Rarely calculated by hand. We will perform these exclusively with R.
ANOVA F-Test Hypotheses
Null hypothesis: the mean outcome \(\mu_i\) is the same across all \(k\) groups, such that each group has the same population mean \(\mu\).
\[
H_0 \colon \mu_1=\mu_2=...=\mu_k=\mu
\]
Alternate hypothesis: at least one mean \(\mu_i\) is different from the other \(k-1\) means, such that there is no single population mean \(\mu\).
\[
H_A \colon \text{At least 1 } \mu_i \ne \mu
\]
Assumptions
- Independent & Identically Distributed Observations: Observations are independent both within and between groups and identically distributed within groups.
- Sample size: There are more than 30 observations in each of the \(k\) groups (\(n_i \ge 30\)).
- Normality: Especially when \(n_i\) are small, data within each of the \(k\) groups is normally distributed. This condition relaxes as \(n_i\) increases (\(n_i \to \infty\)).
- Equal variance: Within-group variance is approximately equal across the \(k\) groups. This condition relaxes as sample sizes \(n_i\) become more balanced between the \(k\) groups (\(\frac{n}{k} \to n_i\)).
The F-Distribution
- The larger the test statistic \(F\), the less likely the observed data is under the null hypothesis
- Like the Chi-squared (\(\chi^2\)) distribution, the upper tail probability is always used for hypothesis testing
ANOVA Results Table
Example: Wolf River Sediments
- The Wolf River in Tennessee flows past an abandoned site once used by the pesticide industry for dumping wastes, including chlordane (pesticide), aldrin, and dieldrin (both insecticides)
- These highly toxic organic compounds can cause various cancers and birth defects
- The standard methods to test whether these substances are present in a river is to take samples at six-tenths depth
- Since these compounds are denser than water and their molecules tend to stick to particles of sediment, they are more likely to be found in higher concentrations near the bottom
The Data
Exploratory Analysis & Sample Statistics
Equal Variance Assumption
![]()
Do these variances look approximately equal?
ANOVA Results
Interpreting The Results
- The test statistic \(F\) was 6.14, meaning the between-groups variability was 6.14 times as large as the within-group variability.
- The p-value was 0.0063, meaning it was very unlikely to have seen this much variability between groups in samples of these sizes, given the null hypothesis that each mean is the same (\(\mu_i=\mu\)).
- The p-value is less than a significance threshold of \(\alpha=0.05\), so we would reject the null hypothesis that the means of the 3 groups are equal. The mean of at least one group differs from the other means.
- But we made some pretty strong assumptions.
- If you want to know which means are different from each other, you will have to do additional pairwise tests between group means.